Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add event tracing and ETDumps to executor_runner #5027

Merged
merged 22 commits into from
Jan 29, 2025

Conversation

benkli01
Copy link
Collaborator

@benkli01 benkli01 commented Sep 2, 2024

  • Enabled via EXECUTORCH_ENABLE_EVENT_TRACER
  • Add flag 'etdump_path' to specify the file path for the ETDump file
  • Add flag 'num_executions' for number of iterations to run
  • Create and pass event tracer 'ETDumpGen'
  • Save ETDump to disk
  • Update docs to reflect the changes

Re-upload of #4502 to discuss with @GregoryComer.

Copy link

pytorch-bot bot commented Sep 2, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/5027

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 662cb81 with merge base 7bc06d1 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 2, 2024
@benkli01
Copy link
Collaborator Author

benkli01 commented Sep 2, 2024

@pytorchbot label 'partner: arm'

@pytorch-bot pytorch-bot bot added the partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm label Sep 2, 2024
@benkli01
Copy link
Collaborator Author

benkli01 commented Sep 2, 2024

@pytorchbot label ciflow/trunk

Copy link

pytorch-bot bot commented Sep 2, 2024

Can't add following labels to PR: ciflow/trunk. Please ping one of the reviewers for help.

@benkli01
Copy link
Collaborator Author

benkli01 commented Sep 4, 2024

Hi @GregoryComer. Would it be possible to run the CI on your side to see if the issue from the previous PR is still occurring? I'm having a hard time understanding where this comes from.

@facebook-github-bot
Copy link
Contributor

@digantdesai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

2 similar comments
@facebook-github-bot
Copy link
Contributor

@digantdesai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@digantdesai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

- Enabled via EXECUTORCH_ENABLE_EVENT_TRACER
- Add flag 'etdump_path' to specify the file path for the ETDump file
- Add flag 'num_executions' for number of iterations to run
- Create and pass event tracer 'ETDumpGen'
- Save ETDump to disk
- Update docs to reflect the changes

Signed-off-by: Benjamin Klimczak <benjamin.klimczak@arm.com>
Change-Id: I7e8e8b7f21453bb8d88fa2b9c2ef66c532f3ea46
@benkli01 benkli01 force-pushed the add-profiling-to-xnn-executor-runner-2 branch from 3288eda to b09d09e Compare September 23, 2024 09:48
@benkli01
Copy link
Collaborator Author

Hi @dbort . Sorry for dragging you into this, but I saw your comment on EXECUTORCH_SEPARATE_FLATCC_HOST_PROJECT in the code, so I thought you might be able to help with resolving the failing test here. Any idea how to fix this?

@digantdesai
Copy link
Contributor

I don't see a CI failure anymore

@benkli01
Copy link
Collaborator Author

I don't see a CI failure anymore

@digantdesai To me pull / test-llama-runner-qnn-linux (fp32, cmake, qnn) / linux-job (pull_request) is showing up as failing after my latest update. The CI run for the previous version you imported did not finish for me, i.e. I could not see any results, but it did not seem to have this test included anyway.

@facebook-github-bot
Copy link
Contributor

@digantdesai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@digantdesai
Copy link
Contributor

Yeah I see, from the main CMakeList, and qnn does have EXECUTORCH_ENABLE_EVENT_TRACER=ON

@digantdesai
Copy link
Contributor

Any update on this?

@benkli01
Copy link
Collaborator Author

Hi @digantdesai, I'm still hoping for some pointer from @dbort or you as I'm struggling to reproduce it locally and can't really make sense of the error.

@benkli01 benkli01 added the release notes: examples Changes to any of our example LLMs integrations, such as Llama3 and Llava label Nov 25, 2024
@freddan80
Copy link
Collaborator

@digantdesai will you have a look at this one since it touches code outside arm delegate. Thx!

@cccclai
Copy link
Contributor

cccclai commented Nov 30, 2024

The error shows up when running this script https://github.com/pytorch/executorch/blob/main/backends/qualcomm/scripts/build.sh based on the log.

If you have a linux machine, can you follow https://pytorch.org/executorch/stable/build-run-qualcomm-ai-engine-direct-backend.html and see if the script fails?

@benkli01
Copy link
Collaborator Author

benkli01 commented Dec 2, 2024

@cccclai I finally managed to reproduce the issue by running script backends/qualcomm/scripts/build.sh with the parameters from the CI script here. Interestingly, the issue seems to be caused by --job_number 2 (not my first guess). If I remove the parameter entirely, defaulting to --job_number 16, the issue disappears (not sure this is an acceptable solution and/or would work in the CI). I'm guessing that this is related to the TODO here. Any input on how to proceed would be much appreciated.

- Raise a CMake error if event tracing is enabled without the devtools
- Re-factoring of the changes in the portable executor_runner
- Minor fix in docs

Change-Id: Ia50fef8172f678f9cbe2b33e2178780ff983f335
Signed-off-by: Benjamin Klimczak <benjamin.klimczak@arm.com>
Copy link
Collaborator Author

@benkli01 benkli01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! All issues be fixed now.

examples/portable/executor_runner/executor_runner.cpp Outdated Show resolved Hide resolved
examples/portable/executor_runner/executor_runner.cpp Outdated Show resolved Hide resolved
examples/portable/executor_runner/executor_runner.cpp Outdated Show resolved Hide resolved
docs/source/tutorial-xnnpack-delegate-lowering.md Outdated Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
backends/xnnpack/CMakeLists.txt Outdated Show resolved Hide resolved
Copy link
Contributor

@dbort dbort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for refactoring executor_runner, it looks great. Remaining issue is the flatccrt dep

CMakeLists.txt Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@dbort has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@dbort
Copy link
Contributor

dbort commented Jan 22, 2025

Thanks for updating the dependency, and again I apologize for how long this has taken me to review. I'm running internal tests now and should be able to merge this soon.

@benkli01
Copy link
Collaborator Author

Thanks @dbort ! There is a linker error remaining in the qnn tests that I could not reproduce locally and I don't understand where it is coming from. Maybe you have an idea...

Copy link
Contributor

@dbort dbort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I see the failure
https://github.com/pytorch/executorch/actions/runs/12866353984/job/35868756676?pr=5027#step:14:34489

/usr/bin/ld: CMakeFiles/executor_runner.dir/examples/portable/executor_runner/executor_runner.cpp.o: in function `main':
executor_runner.cpp:(.text.main+0x5f9): undefined reference to `executorch::etdump::ETDumpGen::ETDumpGen(executorch::runtime::Span<unsigned char>)'
/usr/bin/ld: executor_runner.cpp:(.text.main+0x95d): undefined reference to `executorch::etdump::ETDumpGen::get_etdump_data()'

The build output says "EXECUTORCH_BUILD_DEVTOOLS : OFF", but the linker error implies that ET_EVENT_TRACER_ENABLED is defined.

It does seem like the block you added to the top-level CMakeLists.txt should have triggered a SEND_ERROR in this case.

For debugging, it would help to print EXECUTORCH_ENABLE_EVENT_TRACER in Utils.cmake.

And to reproduce, you might be able to use the cmake command from the log:
https://github.com/pytorch/executorch/actions/runs/12866353984/job/35868756676?pr=5027#step:14:33817

cmake -DCMAKE_INSTALL_PREFIX=cmake-out -DCMAKE_BUILD_TYPE=Release -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON -DEXECUTORCH_BUILD_KERNELS_CUSTOM=OFF -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON -DEXECUTORCH_BUILD_XNNPACK=OFF -DEXECUTORCH_BUILD_MPS=OFF -DEXECUTORCH_BUILD_COREML=OFF -DEXECUTORCH_BUILD_QNN=ON -DQNN_SDK_ROOT=/tmp/qnn/2.28.0.241029 -DPYTHON_EXECUTABLE=python -Bcmake-out .

cmake --build cmake-out -j9 --target install --config Release

rather than waiting for the full CI to run. Could try removing the -DQNN_SDK_ROOT=/tmp/qnn/2.28.0.241029 part since it doesn't seem like it would affect this failure.

- Remove explicit addition of `-DET_EVENT_TRACER_ENABLED` from
  backends/qualcomm/CMakeLists.txt as setting the definition without
  enabling cmake flag `EXECUTORCH_ENABLE_EVENT_TRACER` caused issues
  when building the executor_runner.
- Replace deprecated namespace `torch::executor` with
  `executorch::etdump` in the executor_runner.cpp.

Signed-off-by: Benjamin Klimczak <benjamin.klimczak@arm.com>
Change-Id: Iadff38374e661f42e394dc69903548922ca08aea
@benkli01
Copy link
Collaborator Author

Thanks @dbort ! I managed to find and fix the issue by following your pointers a little further. I think the remaining CI failures are not related to my change.

The problem was that the QNN backend always set the definition ET_EVENT_TRACER_ENABLED without taking into account the status of CMake flag EXECUTORCH_ENABLE_EVENT_TRACER. I.e. when building with EXECUTORCH_BUILD_QNN=ON the definition ET_EVENT_TRACER_ENABLED was always enabled at compile time, but the needed libs might not have been linked depending on the CMake flags.

The fix: I removed the definition in the QNN backend here so that the behavior should now be fully controlled by the CMake flags.

@benkli01 benkli01 requested a review from dbort January 27, 2025 09:34
#
# add compile option
#
target_compile_options(executorch PUBLIC -DET_EVENT_TRACER_ENABLED)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shewu-quic you added this line in https://github.com/pytorch/executorch/pull/2227/files#diff-0f6d37c62838592cf30db121c46cf34a9b9316df4b5a0d98f0ad7c7a98f7ff7eR261

It's currently causing problems because ET_EVENT_TRACER_ENABLED only works when EXECUTORCH_BUILD_DEVTOOLS is also enabled. Is it ok to remove this, or will other scripts/docs need to change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be automatically set by the main CMakeLists.txt as long as the CMake flag EXECUTORCH_ENABLE_EVENT_TRACER is enabled.

@dbort
Copy link
Contributor

dbort commented Jan 28, 2025

@benkli01 Thank you for tracking down the problem with the Qualcomm jobs! @shewu-quic @cccclai please let us know if the change to qualcomm/CMakeLists.txt is ok.

@facebook-github-bot
Copy link
Contributor

@dbort has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@dbort dbort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the CI jobs look good, I'm fine with merging this. Thanks so much for taking the time to figure all of this out!

@dbort dbort merged commit 282c137 into pytorch:main Jan 29, 2025
46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm release notes: examples Changes to any of our example LLMs integrations, such as Llama3 and Llava
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants